1. Load the Libraries and the Data

library(ThresholdR)
library(Seurat)
library(dplyr)
set.seed(42)
# we will install the 'bmcite' dataset from SeuratData::AvailableData()
#SeuratData::InstallData('bmcite')
#then load the data
#so <- SeuratData::LoadData('bmcite')

# Here we are loading the V4 seurat object of bmcite dataset: 
so <- readRDS('bmcite_so.RDS')

2. Process the Seurat object

so <- NormalizeData(so, assay = "ADT",normalization.method="CLR", margin=2)
## Normalizing across cells
# For Seurat V4 object:
adt <- as.data.frame(t(as.matrix(so@assays$ADT@data)))

# For Seurat V5 object:
#adt <- as.data.frame(t(as.matrix(LayerData(so, assay = "ADT",layer='data'))))

3. Running ThresholdR

3.1 BIC value calculation

We assume that the maximum number of underlying populations (components) across all ADT markers is equal to 3 (trimodal). Then we will calculate BIC values which indicate how well a model is fitted. More positive BIC value (calculated using ‘mclust’ R package) indicates a better fitting:

Sys.time()
## [1] "2025-03-16 12:00:19 EDT"
bic.values <- calculateBIC(data = adt, k = 3)
## [1] "CD11a"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD11c"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD123"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD127-IL7Ra"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD14"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD16"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD161"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD19"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD197-CCR7"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD25"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD27"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD278-ICOS"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD28"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD3"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD34"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD38"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD4"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD45RA"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD45RO"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD56"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD57"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD69"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD79b"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD8a"
## [1] 1
## [1] 2
## [1] 3
## [1] "HLA.DR"
## [1] 1
## [1] 2
## [1] 3

3.2 Acquire the possible numbers of underlying components for each marker

To choose the possible fits (k=2 and/or k=3) for each ADT, the user should assign a set of few markers which they believe to have bimodal or trimodal (non-unimodal) distributions. With this input, a margin is defined as the minimum value of BIC2/BIC1 divided by 3 across the set of user-provided markers. If BIC(k+1) is greater BIC(k) by at least that calculated margin in any ADT marker, k+1 is also included in the list of possible k number of components for the corresponding marker.

k.list <- get_k_list(bic_values = bic.values, markers = c("CD3", "CD14", "CD19", "CD45RA"))
## [1] "CD11a"
## [1] "CD11c"
## [1] "CD123"
## [1] "CD127-IL7Ra"
## [1] "CD14"
## [1] "CD16"
## [1] "CD161"
## [1] "CD19"
## [1] "CD197-CCR7"
## [1] "CD25"
## [1] "CD27"
## [1] "CD278-ICOS"
## [1] "CD28"
## [1] "CD3"
## [1] "CD34"
## [1] "CD38"
## [1] "CD4"
## [1] "CD45RA"
## [1] "CD45RO"
## [1] "CD56"
## [1] "CD57"
## [1] "CD69"
## [1] "CD79b"
## [1] "CD8a"
## [1] "HLA.DR"
# If the user wishes to include k=2 and k=3 for all markers in the data matrix, they could modify k_list as the following:
# k.list <- list()
# for (i in names(adt)) {
#   k.list[[i]] <- c(2,3)
# }

In order to see if a marker is not meeting the threshold for any k, and is dropped out of the k_list, we can have a look:

names(bic.values)[!names(bic.values) %in% names(k.list)]
## character(0)

With the estimated list of possible k(s) for markers, we can run the fitting algorithm:

fittings <- fit_k(data = adt, k_list = k.list, margin.den = 0.1, epsilon = 0.01, maxit = 1000, maxrestarts = 10)
## CD11a: 2 modal: number of iterations= 12 
## CD11c: 2 modal: number of iterations= 20 
## CD11c: 3 modal: number of iterations= 79 
## CD123: 2 modal: number of iterations= 30 
## CD123: 3 modal: number of iterations= 44 
## CD127-IL7Ra: 2 modal: number of iterations= 15 
## CD14: 2 modal: number of iterations= 13 
## CD14: 3 modal: number of iterations= 80 
## CD16: 2 modal: number of iterations= 13 
## CD16: 3 modal: number of iterations= 111 
## CD161: 2 modal: number of iterations= 14 
## CD19: 2 modal: number of iterations= 16 
## CD19: 3 modal: number of iterations= 106 
## CD197-CCR7: 2 modal: number of iterations= 19 
## CD25: 2 modal: number of iterations= 30 
## CD27: 2 modal: number of iterations= 18 
## CD27: 3 modal: number of iterations= 49 
## CD278-ICOS: 2 modal: number of iterations= 19 
## CD278-ICOS: 3 modal: number of iterations= 79 
## CD28: 2 modal: number of iterations= 14 
## CD3: 2 modal: number of iterations= 7 
## CD34: 2 modal: number of iterations= 20 
## CD34: 3 modal: number of iterations= 115 
## CD38: 2 modal: number of iterations= 46 
## CD4: 2 modal: number of iterations= 13 
## CD4: 3 modal: number of iterations= 32 
## CD45RA: 2 modal: number of iterations= 25 
## CD45RO: 2 modal: number of iterations= 19 
## CD45RO: 3 modal: number of iterations= 166 
## CD56: 2 modal: number of iterations= 12 
## CD57: 2 modal: number of iterations= 17 
## CD69: 2 modal: number of iterations= 16 
## CD69: 3 modal: number of iterations= 127 
## CD79b: 2 modal: number of iterations= 9 
## CD8a: 2 modal: number of iterations= 17 
## CD8a: 3 modal: number of iterations= 94 
## HLA.DR: 2 modal: number of iterations= 28 
## Fixing Sigma for C1 in CD11a -Bimodal# Fixing Sigma for C1 in CD11c -Bimodal# Fixing Sigma for C1 in CD11c -Trimodal# Fixing Sigma for C1 in CD127-IL7Ra -Bimodal# Fixing Sigma for C1 in CD27 -Bimodal# Fixing Sigma for C1 in CD27 -Trimodal# Fixing Sigma for C1 in CD28 -Bimodal# Fixing Sigma for C1 in CD3 -Bimodal# Fixing Sigma for C1 in CD4 -Bimodal# Fixing Sigma for C1 in CD4 -Trimodal# Fixing Sigma for C1 in CD45RA -Bimodal#

3.3 Calculate the thresholds

We then proceed to calculate different thresholds in bimodal and trimodal fittings:

thresholds <- get_thresholds(fittings = fittings, k_list = k.list)

3.4 Publish the bimodal and trimodal fittings

publish_fittings_thr(fittings = fittings, k_list = k.list, thresholds = thresholds, path = '02Plots/', seed = 42)
Sys.time()
## [1] "2025-03-16 12:00:44 EDT"

The ThresholdR process is completed within 1 minute when running on a MacBook Pro M4 with 16 cores or 2-3 minutes in a linux environment with 1 CPU.

4. Application of the thresholds

Save the threshold values for all markers:

# write.csv(thresholds, '03AllThresholds.csv')

We advise users to choose between suggested thresholds in the ‘03AllThresholds.csv’ file and make the ‘04All_Markers.csv’ file with the extra columns of ‘threshold_col’ and ‘threshold_value’. Then we can threshold the ADT data. This is done through subtracting the final threshold values stored in the ‘threshold_value’ column from each marker expression level in each cell. At the end, the expression levels below the thresholds are set to zero.

markers.thr <- read.csv('04All_Markers.csv')
thAb <- adt[,names(fittings)]
for (i in colnames(thAb)) {
  thAb[,i] <- thAb[,i]-markers.thr[markers.thr$marker_name==i,"threshold_value"]
  thAb[thAb[,i]<0,i] <- 0
}

# if you do not wish to select a threshold out of the suggested ones, you can use the bi_thr column value across the markers that have k=2 in their set of possible k(s) in within k_list:
# thAb <- adt[,names(fittings)]
# for (i in colnames(thAb)) {
#   thAb[,i] <- thAb[,i]-markers.thr[markers.thr$marker_name==i,"bi_thr"]
#   thAb[thAb[,i]<0,i] <- 0
# }

5. Cell-Type Annotation:

To annotate the 7 major cell types of B, CD4T, CD8T, classical (cMo), intermediate (iMo), and non-classical (nMo) monocytes, and NK cells, we use the thresholded values of the following set of main markers: CD3, CD4, CD8 (CD8a), CD14, CD16, CD19, and CD56:

so[["thAb"]] <- CreateAssayObject(data=t(as.matrix(thAb)))
DefaultAssay(so) <- "thAb"
so$annotation <- ifelse(colnames(so) %in% WhichCells(so, expression = CD3==0 & CD4==0 & CD8a==0 & CD14==0 & CD16==0 & CD19>0 & CD56==0), "B",
                        ifelse(colnames(so) %in% WhichCells(so, expression = CD3>0 & CD4>0 & CD8a==0 & CD14==0 & CD16==0 & CD19==0 & CD56==0), "CD4T",
                               ifelse(colnames(so) %in% WhichCells(so, expression = CD3>0 & CD4==0 & CD8a>0 & CD14==0 & CD19==0 & CD56==0), "CD8T",
                                      ifelse(colnames(so) %in% WhichCells(so, expression = CD3==0 & CD14>0 & CD16==0 & CD19==0 & CD56==0), "cMo",
                                             ifelse(colnames(so) %in% WhichCells(so, expression = CD3==0 & CD14>0 & CD16>0 & CD19==0 & CD56==0), "iMo",
                                                    ifelse(colnames(so) %in% WhichCells(so, expression = CD3==0 & CD14==0 & CD16>0 & CD19==0 & CD56==0), "nMo",
                                                           ifelse(colnames(so) %in% union(WhichCells(so, expression = CD3==0 & CD14==0 & CD16==0 & CD19==0 & CD56>0),
                                                                                          WhichCells(so, expression = CD3==0 & CD14==0 & CD16>0 & CD19==0 & CD56>0)), "NK",
                                                                  "Rem")))))))
table(so$annotation)
## 
##    B CD4T CD8T  cMo  iMo   NK  nMo  Rem 
## 3584 7930 5909 6457  175 1650  365 4602

6. Visualizing the annotated cell types

Here we propose using of ADT assay for the purpose of dimension reduction:

DefaultAssay(so) <- "ADT"
so <- so %>%
  ScaleData(features = rownames(so)) %>%
  RunPCA(features = rownames(so))
## Centering and scaling data matrix
## Warning in irlba(A = t(x = object), nv = npcs, ...): You're computing too large
## a percentage of total singular values, use a standard svd instead.
## Warning: Requested number is larger than the number of available items (25).
## Setting to 25.
## Warning: Requested number is larger than the number of available items (25).
## Setting to 25.
## Warning: Requested number is larger than the number of available items (25).
## Setting to 25.
## Warning: Requested number is larger than the number of available items (25).
## Setting to 25.
## Warning: Requested number is larger than the number of available items (25).
## Setting to 25.
## PC_ 1 
## Positive:  CD3, CD28, CD27, CD127-IL7Ra, CD278-ICOS, CD4, CD8a, CD161, CD25, CD45RO 
##     CD45RA, CD197-CCR7 
## Negative:  HLA.DR, CD11c, CD14, CD38, CD69, CD123, CD11a, CD34, CD19, CD79b 
##     CD16, CD56 
## PC_ 2 
## Positive:  CD45RO, CD11a, CD14, CD4, CD11c, CD28, CD3, CD69, CD278-ICOS, CD127-IL7Ra 
##     CD38, CD161 
## Negative:  CD45RA, CD19, CD79b, CD197-CCR7, CD8a, CD57, CD56, CD16, CD27, CD25 
##     HLA.DR, CD34 
## PC_ 3 
## Positive:  CD16, CD56, CD8a, CD11a, CD57, CD161, CD45RA, CD38, CD3, CD127-IL7Ra 
##     CD27, CD11c 
## Negative:  CD19, CD79b, CD4, CD25, HLA.DR, CD197-CCR7, CD69, CD34, CD28, CD45RO 
##     CD14, CD278-ICOS 
## PC_ 4 
## Positive:  CD161, CD25, CD45RO, CD56, CD16, CD4, CD79b, CD19, CD28, CD57 
##     CD127-IL7Ra, HLA.DR 
## Negative:  CD8a, CD27, CD14, CD45RA, CD3, CD11c, CD69, CD38, CD197-CCR7, CD11a 
##     CD34, CD123 
## PC_ 5 
## Positive:  CD4, CD38, CD123, CD16, CD34, CD45RA, CD56, CD278-ICOS, CD28, CD197-CCR7 
##     CD27, CD3 
## Negative:  CD45RO, CD8a, CD69, CD161, CD79b, CD19, CD25, CD57, CD14, CD11a 
##     CD127-IL7Ra, HLA.DR
ElbowPlot(so,ndims = 24)

Choosing the first 20 components from PCA:

so <- so %>%
  FindNeighbors(dims = 1:20)%>%
  FindClusters(resolution = 0.3, random.seed = 42) %>% 
  RunUMAP(dims = 1:20, spread = 1.5, min.dist = 0.5, seed.use = 42)
## Computing nearest neighbor graph
## Computing SNN
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 30672
## Number of edges: 1069725
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9569
## Number of communities: 15
## Elapsed time: 3 seconds
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 12:00:51 UMAP embedding parameters a = 0.3593 b = 1.149
## 12:00:51 Read 30672 rows and found 20 numeric columns
## 12:00:51 Using Annoy for neighbor search, n_neighbors = 30
## 12:00:51 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 12:00:52 Writing NN index file to temp file /var/folders/5_/mvpjsxr157b_s5l6r66v67fh0000gp/T//RtmppzsaAJ/filef341664a3ba0
## 12:00:52 Searching Annoy index using 1 thread, search_k = 3000
## 12:00:56 Annoy recall = 100%
## 12:00:56 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 12:00:57 Initializing from normalized Laplacian + noise (using irlba)
## 12:00:58 Commencing optimization for 200 epochs, with 1349406 positive edges
## 12:01:05 Optimization finished
DimPlot(so, group.by = 'annotation')

Since we have not included CD34 in our main markers, the (CD34+) myeloid progenitors are annotated as the remaining (Rem) cell types. We will plot the denoised expression levels of the 7 main markers plus CD34 next:

library(ggplot2)
library(patchwork)
for (i in c("CD3","CD4","CD8a","CD14","CD16","CD19", "CD56", "CD34")) {
  p.noisy <- FeaturePlot(so, features = paste0("adt_",i), slot = "data")+
    ggtitle(label = paste("Noisy", i))
  p.denoised <- FeaturePlot(so, features = paste0("thab_",i), slot = "data")+
        ggtitle(label = paste("Denoised", i))

  print(wrap_plots(p.noisy, p.denoised, ncol = 2))
}

To see how ThresholdR changed the noise across the major cell types, we exclude the cells with Rem label, and then compare the two dotplots:

DefaultAssay(so) <- "ADT"
print(DotPlot(object = subset(so, cells = WhichCells(so, expression = annotation!="Rem")), features = c("CD3","CD4","CD8a","CD14","CD16","CD19", "CD56"), group.by = 'annotation', dot.min = 0.05, scale = F)+
         ggtitle(label = paste("Noisy ADT")))

DefaultAssay(so) <- "thAb"
print(DotPlot(object = subset(so, cells = WhichCells(so, expression = annotation!="Rem")), features = c("CD3","CD4","CD8a","CD14","CD16","CD19", "CD56"), group.by = 'annotation', dot.min = 0.05, scale = F)+
        ggtitle(label = paste("Denoised thAb")))
## Warning: Removed 33 rows containing missing values or values outside the scale range
## (`geom_point()`).